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Safety-critical systems used in applications that demand high levels of 
dependability, efficiency, and fault-tolerance often use sequential logic 
circuits in its design and implementation. The safety-critical digital system 
typically uses latches, flip-flops, and other memory elements, which are 
prone to the effects of natural faults and single event upsets (SEUs) caused 
by radiation-induced effects. The faults can lead to subsystem failures due to 
the continuous advancement in the realization of the small size transistor. To 
design a reliable digital-based system, it is essential to develop new fault- 
tolerance approaches that are integrated into the design of sequential logic 
circuits. This work proposes a novel fault-tolerant approach based on the 
redundancy of sequential logic circuit, which consists of a variety of design 
components, D flip-flop storage elements linked to a fault injection unit, a 
duplicate modular redundancy, and data monitoring units with a switching 
circuit. The experimental simulation results using a five-state Markov chain 
analysis model prove that the proposed fault-tolerant system can achieve 


System 0.99999998 for reliability of the fault detection coverage (C) which equal to 
0.99999. Finally, we believe that using this new approach of fault-tolerance 
and redundancy would improve the dependability and reliability of next 
generation safety-critical applications. 
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1. INTRODUCTION 

Fault-tolerance and reliability analysis plays an essential role in the design and implementation of 
highly reliable and robust digital control systems [1]—[3]. Safety critical control applications that use these types 
of electronic digital circuits like avionics, space, and industrial control applications have become more 
vulnerable to the effects of faults stemming from different natural resources. Examples of theses faults are 
intermittent faults, permanent single faults, transient single faults, multiple bit upsets (MBUs) or common cause 
faults (CCFs). All these faults may result from different factors like ionizing radiation, harsh environment and 
electromagnetic interference that can undermine and defeat the traditional fault-tolerant techniques even at the 
ground level [4]. Faults may affect digital control systems in a different way based on the level of severity of the 
environment in which the control system is operating. Different fault tolerant digital control systems were 
developed in the literature works to quickly identify the presence of a digital subsystem failure in the control 
system and diagnose its causes in terms of type. However, most of the developed digital systems caused low 
levels of dependability and reliability because of the limited capability of the developed fault tolerance 
mechanism and the inclusion of additional hardware components that are not necessary to the control system 
operation. Although there are some traditional fault tolerant techniques based on the hardware redundancy or 
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the reconfiguration strategies used to mask or correct the event of faults, there is a low fault coverage (C) of 
meeting high degrees of dependability and reliability in these critical control systems. An example of computer 
architecture, the field programmable gate array (FPGA) architecture consists of a two-dimensional array of 
logic blocks and flip-flops connected by the interconnection routing blocks. The logic blocks can perform 
combinational and sequential logic functions using the look up tables (LUTs) and the memory elements utilized 
to realize state machine control units. Combinational components like LUTs and routing resources are 
vulnerable to be affected by permanent faults. These faults can be corrected either by reloading the bitstream 
file or by resetting the FPGA chip. However, the sequential components like memory flip-flops are vulnerable 
to transient faults that can be corrected by the next load of configuration bit stream [5]. 

There are some challenges that stem from applying traditional fault-tolerant techniques in building 
reliable digital control systems. Firstly, the number of tolerated faults is limited to the number of redundant 
components available in the digital control system before the whole system fails. Secondly, the failure of 
redundancy management unit, which monitors the operation of the digital system, coordinates the 
redundancy of the components, and detects if there is a defect in the working element, may cause a whole 
system failure even if there are no actual defects in the working system [6]. The major contribution of this 
research work is overcoming and avoiding these architectural challenges by designing a novel fault-tolerant 
methodology that includes both static and dynamic redundant fault-tolerant systems. This approach consists 
of sequential logic circuit, D flip-flop storage elements linked to a fault injection unit, a duplicate modular 
redundancy, and data monitoring units. The experimental simulation work is presented, and the results prove 
that the approach achieves a robust fault-tolerant digital control system that can be used as a hardware 
platform for ultra-dependable and safety-critical control applications. 


2. PREVIOUS WORKS 

A brief presentation of research works focusing on the topic of fault tolerant digital systems and 
error detection methods is presented in this section. Different methods were used to create different types of 
the fault-tolerant digital embedded system as it is shown in Figure 1. All these presented methods are 
discussed in this section. 


Concurrent error detection (CED) 


Finite state machines (FSMs 


Fault-secure system 


Circuit output mistake 


Boolean difference error calculus (BDEC) 


Error-detection and partial-error correction (EDPEC) 


Full-error detection and correction (FEDC) 


Fault-tolerant Digital System 


Feedback control loop based on a dynamic model 


Figure 1. The different methods that were used for creating fault tolerant digital systems, from the literature 


Almukhaizim and Makris [7] explained a methodology for creating fault-tolerant digital circuits that 
was built based on an expansion of the concurrent error detection (CED) method. They used the CED method 
to accomplish mistake detection as well as to provide error diagnosis and remedy capabilities. 
A fault tolerance method for sequential logic circuits based on the concept of sequential finite state machines 
(FSMs) [8], [9]. The suggested method was relied on the addition of redundant comparable states to 
safeguard a small number of states with a high likelihood of recurrence. All single errors occurring in the 
state variables of highly occurring states or in their combinational logic were guaranteed to be tolerated by 
the redundant states. Their method required little space because just a few states require protection as well as 
improved the fault tolerance of synthesized sequential circuits. Ostanin et al. [10] presented a fault-tolerant, 
low-overhead, and synchronous sequential circuit design. Their approach was based on a fault-secure system. 
Their method consisted of only one fault-secure sequential circuit, one regular (unprotected), one checker, 
and one rather straightforward exclusive OR (XOR) circuit. The recommended scheme's dependability was 
demonstrated for both single stuck-at failures at gate poles and transient, intermittent route delay faults. Each 
subsequent flaw was said to manifest itself after the preceding one has vanished. Ban and Junior [11] 
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established a trade-off between reliability and hardware area overhead by applying hardening methods to the 
arithmetic circuits. Their work also suggested several fault-tolerant strategies in which important component 
gates in mathematical circuits were identified and rated based on the consequences of a circuit output 
mistake. Regarding the area limitation of the design requirements, these crucial gates were toughened first. In 
fact, output bits that were deemed essential to a system were given greater protection priorities, which 
lowered the likelihood of catastrophic mistakes. The researcher selected the boolean difference error calculus 
(BDEC) method that was previously suggested in the literature and expanded it in two ways: first, to account 
for the impact of reliability-enhancement strategies like redundancy, and second, to encompass sequential 
circuit parts [12]. Dug et al. [13] constructed and examined two techniques for creating fault-tolerant 
pipelined sequential and combinational circuits on a FPGA board. Error-detection and partial error correction 
(EDPEC), and full-error detection and correction (FEDC) were considered as evaluated approaches. 
Shalini et al. [14] presented a selective triple modular redundancy (STMR) technique, where fault tolerance 
in digital circuits; hardware redundancy was a suitable approach. To enhance the timing behavior of 
synchronous sequential circuits, by disregarding the delay, the output was precisely determined. 
The selection criteria for STMR included latency and failure likelihood. It was demonstrated through 
simulation that the suggested approach decreased hardware failure by utilizing TMR technique only when 
necessary. The researchers developed a new a feedback control loop connected to a digital pipeline hardware 
system with an appropriate dynamic model to lessen the impact of errors and faults effects on the output [15]. 
The digital blocks whose executed operation was rewinded were selected as data-path registers for the 
correction loops of a robotic industrial arm which have applied correction factors. They evaluated the cost 
and reliability of the suggested technique and compared them to the standard TMR approach. In comparison 
with the triple approach, their method employed 30% fewer slices for FPGA technology. The architectural 
design of a hybrid and fault-tolerant processing core that is using concepts of error detection and correction 
against radiation faults is presented, analyzed, and simulated [16]. The error correction codes were embedded 
among five stages of pipeline processing to identify the run-time faults and operational errors. The 
experimental timing simulation results indicate that the proposed fault-tolerant method is efficient in 
consuming digital hardware resources and its software operation is continuously monitored by intelligent 
fault-tolerant techniques. 


3. THE PROPOSED RESEARCH METHOD 

The proposed fault-tolerant sequential logic system is created to achieve high standards of 
dependability in relation to several fault models, including transient, intermittent, and permanent faults. In the 
proposed fault-tolerant sequential logic system shown in Figure 2, three types of fault tolerance techniques 
are designed against different types of faults. The basic sequential circuit component that is investigated in 
this paper is a D flip flop (F-F) memory element, which has two fixed states and can save one bit at one time. 
In addition, a D flip flop is a bi-stable memory component that can store either a "1" or a "0" bit at a single 
time. Once the storage memory element reads the D input signal, a checking operation is executed in the 
circuit to monitor the status of the synchronous clocking signal whether it is high or low, during which point 
the input signal propagates to the output signal with the rising edge of each synchronous clocking pulse. 
Furthermore, the complementary of the output signal Q is called Q bar as it is shown in Table 1. 
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Figure 2. The proposed fault-tolerant sequential logic system 
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Table 1. D flip flop excitation 
Din Clocking pulse Current state Q Next state Q 
0 


X 
X 
0 
0 
1 
1 


PRR ROO 
BePRrROOrRSO 


1 
0 
1 
0 
1 


To design a highly robust sequential fault-tolerant system which can be resilient to the effects of 
various attacks of natural faults and single upsets, two types of fault tolerance techniques and data monitoring 
units for the two output signals Q and !Q were architected and embedded in the proposed system. For the first 
logic circuit, a exclusive-NOR (XNOR) gate called first data monitoring unit for Q which compare the input of 
D F-F with the next state Q was built, if the output of XNOR is high and equal to 1 that indicates the D F-F 
work normally and no fault appear, at the opposite of the (0) appearance that indicate an error appearance. For 
this purpose, a controlled switch depending on XNOR output was embedded, if the input of this switch is 
equal to 1 the output of Q will flow, and when its input equals to (0) the inverted value of Q will flow. 
Furthermore, a XOR gate called first data monitoring unit for !Q which compare the input of D F-F with the 
next state !Q was built, when its output equal (1) that indicates that the D F-F is working normally and when 
its output equal (0) indicates a fault appearance, so a controlled switch depending of XOR output was 
embedded, when its input equals to (1). The output of !Q will flow, and when its input equals to (0) the 
inverted value of !Q will flow. Consequently, these two types of intelligent fault tolerance techniques can be 
used to tolerate unlimited number of transient and intermittent faults efficiently. Furthermore, two additional 
Data monitoring Units for the output signals Q and !Q of another memory device were proposed. These two 
units use the concept of double modular redundancy (DMR) [17]-[19] with two XNOR gates and another two 
controlled switches that are responsible of detecting and correcting the effects of artificial and natural 
permanent faults. The idea is using an additional spare (D flip-flop), XNOR gates compares the output of a 
switch that follow the first XNOR with the output of the spare D flip flop, if its output equals (1) that indicates 
that no error is observed, and the switch will allow the output of a switch that follow the first XNOR to flow. 
However, when the output equals (0) that indicates that an error is observed, and the switch will allow the 
output of a spare D flip flop to flow. In addition, to make the execution of the proposed design deterministic 
and synchronous, all the digital switches that are used are controlled by a trigger signal which led to that the 
comparison of all the outputs will be at the same time. Represents the excitation equation of the proposed 
digital circuit shown in (1): 


F = [X AND Y AND~Z AND ! Q(t + 1)] OR [~Z AND Q(t + 1)] (1) 


Figure 3(a) presents the first monitory unit (MU1) timing diagram in its Normal State operation 
when no fault appears by using MATLAB Simulink [20]. The input signals ‘X’, ‘Y’, and ‘Z’ are equal to the 
value 1,1,0 respectively and the data input of the D F-F is equal to 1, in this state the MU1 will compare the 
status of input signal with the resulted output signal by using the XNOR1 gate. Additionally, the D flip-flop 
input is checked with the complemented output by using the XOR gate, if both outputs of the XNOR and the 
XOR gates are equal to ‘1’ value, that indicates no fault appearance. Furthermore, Figure 3(b) presents the 
MUI timing diagram when the ‘Q’ output signal of the D F-F is defected with a simulated fault. In this 
scenario, the output data of the XNOR gate will be equal to ‘0’ value and the MU1 will correct this false 
value and replace it with a right value using a programmable digital switch. 
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Figure 3. Timing diagram (a) MU1 in normal NO fault injection and (b) MU] at first fault injection 
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4. RESULTS AND DISCUSSION 

To evaluate the dependable and resilient behavior of the proposed fault-tolerant sequential logic 
circuit and calculate how much it is reliable and secure, a Markov chain diagram comprised of five 
descriptive states was modeled as it is shown in Figure 4 and Table 2 [21]. Three operating states were 
embedded in the reliability model, one state for failing in a safe mode, and one state for failing in an unsafe 
mode. The status of the system is in one of the five states: totally operational, first failing-operational, second 
failing-operational, failing in a safe mode, or failing in an unsafe mode. 


1-(2Ai2+pi1) Cf At 1-(Ai3+pi2) Cf At 


1-3 Ai Cf At 


hi3 Cf At 


Figure 4. Discrete-time Markov chain for the proposed fault-tolerant sequential logic system 


Table 2. The events describing the various states of the discrete-time Markov chain 


Event Characterization 


X Totally operational (the system's D Flip-Flop is fully operational, and one spare is available) 

Y First failing - operational (Q or !Q output of the D Flip-Flop is effected by transient fault or permanent or intermittent fault and 
discovered by XNOR gate, so the switch that follow XNOR or XOR is used for repair ) 

Z Second failing-operational (Q or !Q output of the D Flip-Flop are effected by transient fault or intermittent fault or permanent 
then is effected by another fault and discovered by one of the two XOR gates, so one or two of the switches that follow XOR is 
used for repair by replacing them by the output signal of the spare D memory element) 

FS Failing in a safe mode Operational (Q or !Q output of the spare D Flip-Flop are effected by transient fault or intermittent fault 
or permanent fault but it cannot be repaired) 

FI Failing in an unsafe mode (output signal of the D memory or the spare flip-flop are failed without any detection) 


To analyze the reliable behavior of the designed sequential fault-tolerant system using Markov chain 
models, it can be assumed that each sequential memory element obeys the exponential failure rule and has a 
constant failure rate of A [22]. The probability equation P(t+At) that a fault-tolerant digital sequential circuit 
will fail in future at some time (t+At) can be calculated and written as in the following relationship: 

P(t+At)=1—e—-AAt=AAt (2) 


where, A is the failure rate, and P (t + At): probability that a fault-tolerant digital sequential circuit will fail 
at some time (t + At). 


PX (t+ At) = (1 — 3Ai1 Cf At) PX(t) + pil At PY(t) + pi2 At PZ(t) 
PY (t + At) = [1 — (Ai2 + wi1)Cf At] PY(t) 
PZ (t + At) = [1 — (Ai3 + pi2)Cf At]PZ(t) 


PFS (t + At) = Ai3 Cf At PZ(t) + PFS(t) 
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PFI (t + At) = 3Ai1(1 — Cf)At)PX(t) + 2Ai2(1 — Cf)At PY(t) + Ai3(1 — Cf)At PZ(t) + PFI(t) 
The reliability can be computed from (3): 
R (t) = 1 — PFS(t) — PFI(t) = PX(t) PY(t) + PZ(t) (3) 
where, 


PX(t + At) — PX(t) 
re = —31i1 Cf PX(t) + pil PY(t) + pi2 PZ(t) 


PY(t + At) — PY(t) _ 
At ~ 


—(2Ai2 + pi1)Cf PY(t) 


PZ(t + At) — PD(t) _ 
At ~ 


—(Ai3 + pi2) Cf PZ(t) 


PFS(t + At) — PFS(t) 


= Ai3 Cf PZ(t 
Ri i (t) 


PUU a ee 3Ai1(1 — Cf)PX(t) + 2Ai2(1 — CAPY(t) + Ai3(1 — Cf)PZ(t) 


At 
PX((t + At) PX (t) 
PY((t + At) PY (t) 
P — system (t+ At) = PZ((t+ At) , P — system(t) = PZ (t) 
PFS((t + At) PFS(t) 
PFI((t + At) PFI(t) 


The two-dimensional state transition matrix of a Markov model would resemble: 


— 3Ai1Cf pil i2 0 0 

0 —(2Ai2 + wil) Cf 0 0 0 

P — system(t + At) = 0 0 —(Ai3 + wi2)Cf 0 0 
0 0 Ai3 Cf 0 0 

34i1 (1 — Cf) 24i2(1 — Cf) Ai3(1 — Cf) 0 0 


Using algebraic manipulation to let the temporal interval t decrease to zero, the following differential 
equations are produced: 


d PX(t) 
=~ = ~3AILCEPX(t) + pi PYÇH) + pi2 PZ() 
d PY(t) 
<= -(2 412 + pi1)CF PY( 
d PZ(t) 
= = =(Ai3 + pi2) CFPZ(E) 
TRO BAU = hi3 Cf PZ(t 
d PFI(t) 
Fe = 31i1(1 — CAYPX(t) + 2Ai2(1 — CAPY(t) + Ai3(1 — CPZ) 


The following equations have been constructed using the Laplace transform: 


S PX(S)- PX(0) = —3Ai1Cf PX(S) + pil PY(S) + pi2 PZ(S) 
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S PY(S)- PY(0) = —(2 Ai2 + pi1)Cf PY(S) 

S PZ(S)- PZ(0) = —(Ai3 + pi2)Cf PZ(S) 

S PFS(S)- PFS(0) = Ai3 Cf PZ(S) 

S PFI(S)- PFI(0) = 3Ai1(1 — Cf) PX(S) + 2Ai2(1 — Cf) PY(S) + Ai3(1 — Cf) PZ(S) 


System reliability R(t) is typically defined as the probability that a logic circuit operate without 
going to failure during the period [0, t]. In addition, reliability is considered as an evaluation metric for 
measuring that the predicted service is reached to customer [23] and [24]. In (4) represents reliability and 
how it is calculated. On the other hand, the safety is an extened concept of the reliability. The safety of a 
logic circuit S(t) is defined as the probability of a circuit to execute its predicted function completely or 
transition to operate in a failing in a safe mode in the period [0, t]. Hence, (5) represents the safety and how it 
is calculated. 


R(t) = PX(t) + PY(t) + PZ(t) (4) 
S(t) = PX(t) + PY(t) + PZ(t) + PFS(t) (5) 


The stratix IV FPGA fabric which has been assumed to be a target realization platform has 38.1 FIT 
failure rate. Where FIT refers to failure in time which is a unit that represents how many failures can be occur 
every 10° hours in time. FIT = hours * 10~? So, hours = 38.1 * 107° failure/hour. Altera's stratix IV 
FPGA chip has the same frequency as 50 MHz. Thus, the mean time to repair (MTTR) for one clock is 20 ns. 
In Table 3, it can be observed that we compare the probability state values of different fault detection 
coverage values which represents the probability of being in different states in Figure 4. Additionally, the 
WINSTEM SURE analysis program [25] was used to model the reconfigurable behaviour of the proposed 
system. The SURE program is a reliability analysis simulation tool that is developed by the National 
Aeronautics and Space Administration (NASA) agency to calculate the probabilities of failure rate. Table 4 
demonstrates reliability and safety at different fault detection coverage values and in Figure 5, we can 
observe the fault detection coverage versus reliability. 


Table 3. Probabilities for different states with different fault detection coverage values 


R State X probability State Y probability State Z probability FE a : 

08 9.99002728523E- 7.9685 1758717E- 2.127983 15173E- 1.8963298 1572E- 2.00206901247E- 

` 0001 0004 0007 0011 0004 
0.85 9.99002947007E- 8.46655055530E- 2.40229356768E- 2.27457729220E- 1.50157684996E- 

` 0001 0004 0007 0011 0004 
09 9.99003 165494E- 8.96458359624E- 2.69322887327E- 2.70004784426E- 1.00106796057E- 

i 0001 0004 0007 0011 0004 
0.99 9.99003558777E- 9.86104325345E- 3.25880715104E- 3.59376382250E- 1.0010980701 1E- 

, 0001 0004 0007 0011 0005 
0.999 9.99003598106E- 9.95068923215E- 3.31832753489E- 3.69266929356E- 1.00110108110E- 

í 0001 0004 0007 0011 0006 
0.9999 9.99003602039E- 9.95965383015E- 3.32430919884E- 3.70265847616E- 1.00110138220E- 

: 0001 0004 0007 0011 0007 
0.99999 9.99003602432E- 9.96055028995E- 3.32490766149E- 3.70365838407E- 1.00110141230E- 

i 0001 0004 0007 0011 0008 


Table 4. Reliability and safety with different fault detection coverage values 
Coverage (C) Reliability results Safety results 


0.8 0.999849619985586768 0.999849620005 
0.85 0.999849842291886768 0.999849842315 
0.9 0.9998998931765113227 0.999899893204 
0.99 0.999989988983060 0.999989989019 
0.999 0.999998998861968489 0.99999899889889518193560 
0.9999 0.999999899852934884 0.9999998998899614687616 
0.99999 0.999999989951761149 0.9999999899887977328407 
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Figure 5. Fault detection coverage versus reliability 


5. CONCLUSION AND FUTURE WORK 

In this paper, we presented an architectural design and reliability analysis of a novel fault-tolerant 
sequential logic circuit for safety-critical digital applications. The primary objective is overcoming the 
deficiencies and faults can attack the operation of latches and D flip-flops embedded in safety-critical sequential 
circuits. The advantage of the approach is that it tolerates an unlimited number of intermittent and transient 
faults. We demonstrated the experimental results of achieving high levels of reliability by simulating the fault 
injection campaigns into the output signals of memory storage elements. The results prove that the proposed 
system can achieve 0.9998 reliability and safety for the fault detection coverage which is equal to 0.8 and 
achieve 0.99999998 reliability for the coverage equals to 0.99999. For the future work, it is planned to focus on 
using the mathematical verification concepts that could be utilized to validate the operational execution of the 
data monitoring models. Furthermore, the proposed circuit can operate in critical environments that generate 
potential CCFs by adding a hybrid fault-tolerant mechanism with spare sequential components. Finally, 
generating the hardware description language (HDL) code using the MathWorks Simulink-based HDL coder 
and synthesizing the proposed circuit in real-time is one of the future works. 
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